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Abstract 


Under  certain  conditions  (known  as  the  Restricted  Isometry  Property  or  RIP)  on  the  m  x  N- 
matrix  $  (where  m  <  N),  vectors  x  £  RN  that  are  sparse  (i.e.  have  most  of  their  entries  equal 
to  zero)  can  be  recovered  exactly  from  y  :=  $2;  even  though  $^1(y)  is  typically  an  ( N  —  m)- 
dimensional  hyperplane;  in  addition  x  is  then  equal  to  the  element  in  <I>_1(?/)  of  minimal  U-norm. 
This  minimal  element  can  be  identified  via  linear  programming  algorithms. 

We  study  an  alternative  method  of  determining  x,  as  the  limit  of  an  Iteratively  Re-weighted 
Least  Squares  (IRLS)  algorithm.  The  main  step  of  this  IRLS  finds,  for  a  given  weight  vector 
w,  the  element  in  d>_1(?/)  with  smallest  G('aO-norm.  If  x^  is  the  solution  at  iteration  step  n, 


then  the  new  weight  w is  defined  by  w ^  := 


„(") 12 


-1/2 


,  *  =  1, . . . ,  N,  for  a  decreasing 


sequence  of  adaptively  defined  e„;  this  updated  weight  is  then  used  to  obtain  x (n+1)  and  the 
process  is  repeated.  We  prove  that  when  $  satisfies  the  RIP  conditions,  the  sequence  aJ") 
converges  for  all  y,  regardless  of  whether  <I>-1(y)  contains  a  sparse  vector.  If  there  is  a  sparse 
vector  in  d>_1(?/),  then  the  limit  is  this  sparse  vector,  and  when  is  sufficiently  close  to  the 
limit,  the  remaining  steps  of  the  algorithm  converge  exponentially  fast  ( linear  convergence  in 
the  terminology  of  numerical  optimization).  The  same  algorithm  with  the  “heavier”  weight 


(n) 

w ;  = 


„(«) 12 


-I+t/2 


i  =  1, . . . ,  N,  where  0  <  r  <  1,  can  recover  sparse  solutions  as  well; 
more  importantly,  we  show  its  local  convergence  is  superlinear  and  approaches  a  quadratic  rate 
for  r  approaching  to  zero. 


1  Introduction 

Let  be  an  m  x  N  matrix  with  m  <  N  and  let  y  £  Mm.  (In  the  compressed  sensing  application 
that  motivated  this  study,  typically  has  full  rank,  i.e.  Ran(<I>)  =  Mm.  We  shall  implicitly  assume, 
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throughout  the  paper,  that  this  is  the  case.  Our  results  still  hold  for  the  case  where  Ran(<h)  C  Mm, 
with  the  proviso  that  y  must  then  lie  in  Ran(<3?).) 

The  linear  system  of  equations 

<3?x  =  y  (1.1) 

is  underdetermined,  and  has  infinitely  many  solutions.  If  J\f  :=  A7(<L)  is  the  null  space  of  <h  and  xq  is 
any  solution  to  (1.1)  then  the  set  F(y)  :=  <h~1(y)  of  all  solutions  to  (1.1)  is  given  by  iF(y)  =  xq+M . 

In  the  absence  of  any  other  information,  no  solution  to  (1.1)  is  to  be  preferred  over  any  other. 
However,  many  scientific  applications  work  under  the  assumption  that  the  desired  solution  x  E  J~{y) 
is  either  sparse  or  well  approximated  by  (a)  sparse  vector (s).  Here  and  later,  we  say  a  vector  has 
sparsity  k  (or  is  k-sparse)  if  it  has  at  most  k  nonzero  coordinates.  Suppose  then  that  we  know 
that  the  desired  solution  of  (1.1)  is  fc-sparse,  where  k  <  m  is  known.  How  could  we  find  such  an 
xl  One  possibility  is  to  consider  any  set  T  of  k  column  indices  and  find  the  least  squares  solution 
xT  :=  argmin^gjp^)  || <&tZ  ~  V ||«™,  where  3>t  is  obtained  from  $  by  setting  to  zero  all  entries  that 
are  not  in  columns  from  T.  Finding  xT  is  numerically  simple  (see  (1.9)).  After  finding  each  xT ,  we 
choose  the  particular  set  T*  that  minimizes  the  residual  ||<3?t£  —  y  ||^m  .  This  would  find  a  /c-sparse 
solution  (if  it  exists),  x*  =  xT* .  However,  this  naive  method  is  numerically  prohibitive  when  N 
and  k  are  large,  since  it  requires  solving  (^)  least  squares  problems. 

An  attractive  alternative  to  the  naive  minimization  is  its  convex  relaxation  that  consists  in 
selecting  the  element  in  J-(y)  which  has  minimal  ^i-norm: 


x  :=  argmin  ||z|[^jv. 

z&F(y) 


Here  and  later  we  use  the  fp-norms 

IN  Up  '■=  \\x\\t$  ■= 


\xj\p)  /P  ,  0  <  p  <  oo, 
max^g.^jv  \xj\,  p  =  oo. 


(1.2) 


(1.3) 


Under  certain  assumptions  on  $  and  y  that  we  shall  describe  in  §2,  it  is  known  that  (1.2)  has  a 
unique  solution  (which  we  shall  denote  by  x*),  and  that,  when  there  is  a  fc-sparse  solution  to  (1.1), 
(1.2)  will  find  this  solution  [3,  7,  20,  21].  Because  the  problem  (1.2)  can  be  formulated  as  a  linear 
program,  it  is  numerically  tractable. 

Solving  underdetermined  systems  by  ^i-minimization  has  a  long  history.  It  is  at  the  heart  of 
many  numerical  algorithms  for  approximation,  compression,  and  statistical  estimation.  The  use 
of  the  1 1  -norm  as  a  sparsity-promoting  functional  can  be  found  first  in  reflection  seismology  and 
in  deconvolution  of  seismic  traces  [16,  37,  38].  Rigorous  results  for  ^i-minimization  began  to  ap¬ 
pear  in  the  late-1980’s,  with  Donoho  and  Stark  [23]  and  Donoho  and  Logan  [22],  Applications 
for  i i -minimization  in  statistical  estimation  began  in  the  mid-1990’s  with  the  introduction  of  the 
LASSO  and  related  formulations  [39]  (iterative  soft-thresholding),  also  known  as  Basis  Pursuit 
[15],  proposed  in  compression  applications  for  extracting  the  sparsest  signal  representation  from 
highly  overcomplete  frames.  Around  the  same  time  other  signal  processing  groups  started  using  t\- 
minimization  for  the  analysis  of  sparse  signals;  see,  e.g.  [32].  The  applications  and  understanding 
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of  /^-minimization  saw  a  dramatic  increase  in  the  last  5  years  [20,  24,  21,  25,  7,  4,  3,  6],  with  the 
development  of  fairly  general  mathematical  frameworks  in  which  l \  -minimization,  known  heuristi- 
cally  to  be  sparsity-promoting,  can  be  proved  to  recover  sparse  solutions  exactly.  We  shall  not  trace 
all  the  relevant  results  and  applications;  a  detailed  history  is  beyond  the  scope  of  this  introduction. 
We  refer  the  reader  to  the  survey  papers  [5,  1].  The  reader  can  also  find  a  comprehensive  collection 
of  the  ongoing  recent  developments  at  the  web-site  http://www.dsp.ece.rice.edu/cs/.  In  fact, 
/'i -minimization  has  been  so  surprisingly  effective  in  several  applications,  that  Candes,  Wakin,  and 
Boyd  call  it  the  “modern  least  squares”  in  [8].  We  thus  clearly  need  efficient  algorithms  for  the 
minimization  problem  (1.2). 

Several  alternatives  to  (1.2),  see,  e.g.,  [26,  31],  have  been  proposed  as  possibly  more  efficient 
numerically,  or  simpler  to  implement  by  non-experts,  than  standard  algorithms  for  linear  program¬ 
ming  (such  as  interior  point  or  barrier  methods).  In  this  paper  we  clarify  fine  convergence  properties 
of  one  such  alternative  method,  called  Iteratively  Re-weighted  Least  Squares  minimization  (IRLS). 
It  begins  with  the  following  observation  (see  §2  for  details).  If  (1.2)  has  a  solution  x*  that  has  no 
vanishing  coordinates,  then  the  (unique!)  solution  xw  of  the  weighted  least  squares  problem 

xw  :=  argmin  w  :=  (uq, . . . ,  wjy),  where  Wj  :=  |a;H_1,  (1.4) 

ze  Hv)  2 

coincides  with  x*.  (The  following  argument  provides  a  short  proof  by  contradiction  of  this  state¬ 
ment.  Assume  that  x*  is  not  the  1 2  (rc)-minimizer.  Then  there  exists  77  £  A f  such  that  ||a;*  + 
V\\ ]n{w)  <  ||s*|| or  equivalently  \\H\]n{w)  <  ~T,f=i  wjVjX*  =  Vj  sign(x*).  However, 

because  x*  is  an  /[-minimizer,  we  have  ||x*||.£1  ^  J|x*  +  hrjW^  for  all  /i  /  0;  taking  h  sufficiently 
small,  this  implies  YljLi  r/j  sign  (a/p  =  0,  a  contradiction.) 

Since  we  do  not  know  x*,  this  observation  cannot  be  used  directly.  However,  it  leads  to  the 
following  paradigm  for  finding  x*.  We  choose  a  starting  weight  w°  and  solve  (1.4)  for  this  weight. 
We  then  use  this  solution  to  define  a  new  weight  w1  and  repeat  this  process.  An  IRLS  algorithm 
of  this  type  appears  for  the  first  time  in  the  approximation  practice  in  the  Ph.D.  thesis  of  Lawson 
in  1961  [30],  in  the  form  of  an  algorithm  for  solving  uniform  approximation  problems,  in  partic¬ 
ular  by  Chebyshev  polynomials,  by  means  of  limits  of  weighted  norm  solutions.  This  iterative 
algorithm  is  now  well-known  in  classical  approximation  theory  as  Lawson’s  algorithm.  In  [17]  it 
is  proved  that  this  algorithm  has  in  principle  a  linear  convergence  rate.  In  the  1970s  extensions 
of  Lawson’s  algorithm  for  Zp-minimization,  and  in  particular  Zi -minimization,  were  proposed.  In 
signal  analysis,  IRLS  was  proposed  as  a  technique  to  build  algorithms  for  sparse  signal  reconstruc¬ 
tion  in  [28].  Perhaps  the  most  comprehensive  mathematical  analysis  of  the  performance  of  IRLS 
for  £p-minimization  was  given  in  the  work  of  Osborne  [33]. 

Osborne  proves  that  a  suitable  IRLS  method  is  convergent  for  1  <  p  <  3.  For  p  =  1,  if  wn 
denotes  the  weight  at  the  nth  iteration  and  xn  the  minimal  weighted  least  squares  solution  for  this 
weight,  then  the  algorithm  considered  by  Osborne  defines  the  new  weight  wn+1  coordinatewise  as 
w™+1  :=  |x"|-1.  His  main  conclusion  in  this  case  is  that  if  the  l\  minimization  problem  (1.2)  has  a 
unique  solution,  then  the  algorithm  converges  to  this  solution,  in  principle  with  linear  convergence 
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rate,  i.e.  exponentially  fast,  with  a  constant  “contraction  factor”. 

However,  the  analysis  of  Osborne  does  not  take  into  consideration  what  happens  if  one  of  the 
coordinates  vanishes  at  some  iteration  n,  i.e.  x"  =  0.  Taking  this  to  impose  that  the  corresponding 
weight  component  must  “equal”  oo  leads  to  x”+1  =  0  at  the  next  iteration  as  well;  this  then 

persists  in  all  later  iterations.  If  x*  =  0,  all  is  well,  but  if  there  is  an  index  j  for  which  x*  ^  0, 
yet  x™  =  0  at  some  iteration  step  n,  then  this  “infinite  weight”  prescription  leads  to  problems.  In 
practice,  this  is  avoided  by  changing  the  definition  of  the  weight  at  coordinates  j  where  x™  =  0 
(see  [31]  and  [10,  27]  where  a  variant  for  total  variation  minimization  is  studied);  such  modified 
algorithms  need  no  longer  converge  to  x*,  however).  Because  Osborne’s  convergence  proof  is  local, 
it  implies  that  if  the  iterations  begin  with  a  vector  sufficiently  close  to  the  solution,  and  if  the 
solution  is  unique  and  has  only  nonzero  entries,  then  none  of  the  x™  =  0  vanish,  and  the  weight- 
change  is  not  required;  Osborne’s  analysis  does  indeed  show  the  linear  convergence  rate  of  the 
algorithm  under  these  assumptions.  Unfortunately,  as  we  will  see  in  Remark  2.2,  the  uniqueness  of 
the  solution  necessarily  implies  that  it  has  vanishing  components.  In  other  words,  the  set  of  vectors 
to  which  Osborne’s  analysis  applies  is  vacuous. 

The  purpose  of  the  present  paper  is  to  put  forward  an  IRLS  algorithm  that  gives  a  re-weighting 
without  infinite  components  in  the  weight,  and  to  provide  an  analysis  of  this  algorithm,  with  various 
results  about  its  convergence  and  rate  of  convergence.  It  turns  out  that  care  must  be  taken  in  just 
how  the  new  weight  wn+l  is  derived  from  the  solution  xn  of  the  current  weighted  least  squares 
problem.  To  manage  this  difficulty,  we  shall  consider  a  very  specific  recipe  for  generating  the 
weights.  Other  recipes  are  certainly  possible. 

Given  a  real  number  e  >  0  and  a  weight  vector  w  E  R^,  with  Wj  >  0,  j  =  1 , ,N,  we  define 


J(z,w,e)  :=  - 


N 

E 

3= 1 


z]Wj 


N 

+£<« 

3= 1 


w. 


-1', 


z  E 


D  N 


(1.5) 


Given  w  and  e,  the  element  z  E  M.N  that  minimizes  J  is  unique  because  J  is  strictly  convex. 

Our  algorithm  will  use  an  alternating  method  for  choosing  minimizers  and  weights  based  on 
the  functional  J .  To  describe  this,  we  define  for  z  E  RjV  the  non-increasing  rearrangement  r(z)  of 
the  absolute  values  of  the  entries  of  z.  Thus  r(z)i  is  the  z’-th  largest  element  of  the  set  { \zj\.  j  = 
1, . . . ,  N},  and  a  vector  v  is  k- sparse  iff  r(v)k+ i  =  0. 


Algorithm  1  We  initialize  by  taking  w°  :=  (1, . . . ,  1).  We  also  set  eo  :=  1.  We  then  recursively 
define  for  n  =  0, 1, . . . , 


xn+1  :=  argmin  J(z,wn ,en)  =  argmin  HzH^z^n) 


and 


z£T(y) 


en+i  :=  minfe* 


z£F(y) 
r(xn+1)  K+1 


N 


), 


(1.6) 

(1.7) 


where  K  is  a  fixed  integer  that  will  be  described  more  fully  later.  We  also  define 


wn+l  :=  argmin  J(xn+1,w,en+ 1). 

U!>0 


(1.8) 
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We  stop  the  algorithm  if  en  =  0;  in  this  case  we  define  xJ  :=  xn  for  j  >  n.  However,  in  general,  the 
algorithm  will  generate  an  infinite  sequence  (xn)nSN  of  distinct  vectors.  □ 

Each  step  of  the  algorithm  requires  the  solution  of  a  least  squares  problem.  In  matrix  form 

xn+1  =  Dn$t(&Dn&t)-1y,  (1.9) 

where  Dn  is  the  N  x  N  diagonal  matrix  whose  j-th  diagonal  entry  is  re™  and  At  denotes  the 
transpose  of  the  matrix  A.  Once  xn+l  is  found,  the  weight  wn+l  is  given  by 

wn+l  =  [(xn+ 1)2  +  e2+l]-l/2;  J  =  1,  .  .  .  ,  IV.  (1.10) 

We  shall  prove  several  results  about  the  convergence  and  rate  of  convergence  of  this  algorithm. 
This  will  be  done  under  the  following  assumption  on  <h. 

The  Restricted  Isometry  Property  (RIP):  We  say  that  the  matrix  satisfies  the  Re¬ 
stricted  Isometry  Property  of  order  L  with  constant  5  £  (0, 1)  if  for  each  vector  2:  with  sparsity  L 
we  have 

(1  -  <*)IMI  1%  <  ll^ll £?  <  (1  +  <5)||z||^.  (1.11) 

The  RIP  was  introduced  by  Candes  and  Tao  [7,  4]  in  their  study  of  compressed  sensing  and  i\- 
minimization.  It  has  several  analytical  and  geometrical  interpretations  that  will  be  discussed  in 
§3.  To  mention  just  one  of  these  results  (see  [18]),  it  is  known  that  if  4>  has  the  RIP  of  order 
L  :=  J  +  J' ,  with  5  <  (here  J'  >  J)  and  if  (1.1)  has  a  J-sparse  solution  z  G  £F(y),  then  this 

solution  is  the  unique  t\  minimizer  in  £F{y).  (This  can  still  be  sharpened:  in  [9],  Candes  showed 
that  if  £F(y)  contains  a  J-sparse  vector,  and  if  has  RIP  of  order  2 J  with  5  <  \/2  —  1,  then  that 
J-sparse  vector  is  unique  and  is  the  unique  l\  minimizer  in  T(y).) 

The  main  result  of  this  paper  (Theorem  5.3)  is  that  whenever  satisfies  the  RIP  of  order  K+K' 
(for  some  K'  >  K)  and  5  sufficiently  close  to  zero,  then  Algorithm  1  converges  to  a  solution  x  of 
(1.1)  for  each  y  £  Mm.  Moreover,  if  there  is  a  solution  z  to  (1.1)  that  has  sparsity  k  ^  K  —  n,  then 
x  =  z.  Here  k  >  1  depends  on  the  RIP  constant  5  and  can  be  made  arbitrarily  close  to  1  when  5 
is  made  small.  The  result  cited  in  our  previous  paragraph  implies  that  in  this  case  x  =  x*,  where 
x*  is  the  -minimal  solution  to  (1.1). 

A  second  part  of  our  analysis  concerns  rates  of  convergence.  We  shall  show  that  if  (1.1)  has  a 
A:-sparse  solution  with,  e.g.,  k  ^  A"  — 4  and  if  <3?  satisfies  the  RIP  of  order  3A'  with  5  sufficiently  close 
to  zero,  then  Algorithm  1  converges  exponentially  fast  to  x  =  x* .  Namely,  once  xn°  is  sufficiently 
close  to  its  limit  x,  we  have 

||x  —  xn+1||^Ar  ^  n\\x  —  xn\\iN:  n^no,  (1.12) 

where  y  <  1  is  a  fixed  constant  (depending  on  5).  From  this  result  it  follows  that  we  have 
exponential  convergence  to  x  whenever  x  is  fc-sparse;  however  we  have  no  real  information  on  how 
long  it  will  take  before  the  iterates  enter  the  region  where  we  can  control  y.  (Note  that  this  is  similar 
to  convergence  results  for  the  interior  point  algorithms  that  can  be  used  for  direct  £  1  -minimization.) 
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The  potential  of  IRLS  algorithms,  tailored  to  mimic  -minimization  and  so  recover  sparse 
solutions,  has  recently  been  investigated  numerically  by  Chartrand  and  several  co-authors  [11,  12, 
14].  Our  work  provides  proofs  of  several  findings  listed  in  these  works. 

One  of  the  virtues  of  our  approach  is  that,  with  minor  technical  modifications,  it  allows  a  similar 
detailed  analysis  of  IRLS  algorithms  with  weights  that  promote  the  non-convex  optimization  of  iT- 
norrns  for  0  <  r  <  1.  We  can  show  not  only  that  these  algorithms  can  again  recover  sparse 
solutions,  but  also  that  their  local  rate  of  convergence  is  superlinear  and  tends  to  quadratic  when 
r  tends  to  zero.  Thus  we  also  justify  theoretically  the  recent  numerical  results  by  Chartrand  et  al. 
concerning  such  non-convex  ^T-norm  optimization  [11,  12,  13,  36]. 

An  outline  of  our  paper  is  the  following.  In  the  next  section  we  make  some  remarks  about 
l\-  and  weighted  ^-minimization,  upon  which  we  shall  call  in  our  proof.  In  the  following  section, 
we  recall  the  Restricted  Isometry  Property  and  the  Null  Space  Property  including  some  of  its 
consequences  that  are  important  to  our  analysis.  In  section  4,  we  gather  some  preliminary  results 
we  shall  need  to  prove  our  main  convergence  result,  Theorem  5.3,  which  is  formulated  and  proved  in 
section  5.  We  then  turn  to  the  issue  on  rate  of  convergence  in  section  6.  In  section  7  we  generalize 
the  convergence  results  obtained  for  tj -minimization  to  the  case  of  £r-spaces  for  0  <  r  <  1;  in 
particular,  we  show,  with  Theorem  7.9,  the  local  superlinear  convergence  of  the  IRLS  algorithm  in 
this  setting.  We  conclude  the  paper  with  a  short  section  dedicated  to  a  few  numerical  examples 
that  dovetail  nicely  with  the  theoretical  results. 

2  Characterization  of  t\-  and  weighted  ^-mimmizers 

We  fix  y  £  Mm  and  consider  the  underdetermined  system  4*2;  =  y.  Given  a  norm  |j  •  ||,  the  problem 
of  minimizing  ||z||  over  z  G  .F(y)  can  be  viewed  as  a  problem  of  approximation.  Namely,  for  any 
xq  G  we  can  characterize  the  minimizers  in  J-(y)  as  exactly  those  elements  z  G  T(y)  that 

can  be  written  as  z  =  xq  +  rj,  with  r/  a  best  approximation  to  —  xq  from  J\f.  In  this  way  one 
can  characterize  minimizers  2  from  classical  results  on  best  approximation  in  norrned  spaces.  We 
consider  two  examples  of  this  in  the  present  section,  corresponding  to  the  G-norm  and  the  weighted 
^('ujj-norm. 

Throughout  this  paper,  we  shall  denote  by  x  any  element  from  J-(y)  that  has  smallest  G-norm, 
as  in  (1.2).  When  x  is  unique,  we  shall  emphasize  this  by  denoting  it  by  x* .  In  general,  x  and 
x*  need  not  be  sparse,  although  we  will  often  consider  cases  where  they  are.  We  begin  with  the 
following  well-known  lemma  (see  for  example  Pinkus  [34])  which  characterizes  the  minimal  G-norrn 
elements  from  T(y). 

Lemma  2.1  An  element  x  €  J-(y)  has  minimal  £\-norm  among  all  elements  z  G  J~{y)  if  and  only 

if 

|  ^2  sign(xi)rn\  ^  V  €  (2.1) 

X-i^O  Xi — 0 
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Moreover,  x  is  unique  if  and  only  if  we  have  strict  inequality  in  (2.1)  for  all  p  £  M  which  are  not 
identically  zero. 

Proof:  We  give  the  simple  proof  for  completeness  of  this  paper.  If  x  £  J-{y)  has  minimum 

fi-norm,  then  we  have,  for  any  rj  £  N  and  any  i  £  f, 

N  N 

Y  \xj  +  tr}j\  ^  yt  \xj\.  (2.2) 

i=l  i= 1 

Fix  rj  £  J\f .  If  t  is  sufficiently  small  then  Xi  +  trji  and  Xi  will  have  the  same  sign  s,  :=  sign(;cj) 
whenever  Xi  ^  0.  Hence,  (2.2)  can  be  written  as 

t  Y  s,Jh  +  ^2  1^*1  ^ 

X-i^O  X{ — 0 

Choosing  t  of  an  appropriate  sign,  we  see  that  (2.1)  is  a  necessary  condition. 

For  the  opposite  direction,  we  note  that  if  (2.1)  holds  then  for  each  r]  £  J\f ,  we  have 

N 

=  SjXj  =  Y  Si(Xi  +  rh)  ~  Y  SWi 

i= 1  x^O  x^  0  x^  0 

N 

<  Y  si(xi  +  Vi)  +  Y  M  \xi  +  Vil  (2-3) 

X{^0  Xi= 0  i=l 

where  the  first  inequality  uses  (2.1). 

If  x  is  unique  then  we  have  strict  inequality  in  (2.2)  and  hence  subsequently  in  (2.1).  If  we  have 
strict  inequality  in  (2.1)  then  the  subsequent  strict  inequality  in  (2.3)  implies  uniqueness.  ■ 


Remark  2.2  Applying  Lemma  2.1  to  the  special  case  of  £i-minimizers  with  no  vanishing  entries, 
we  see  that  a  vector  x  £  with  Xi  ^  0  for  all  i  =  1, . . . ,  IV,  is  a  minimal  ^i-norm  solution  if 

and  only  if 

N 

'Y  siVi  =  0;  f°r  all  f?  £  A f .  (2.4) 

i= 1 

This  implies  that  a  minimal  ^i-norrn  solution  to  <hx  =  y  for  which  all  entries  are  non-vanishing  is 
necessarily  non-unique,  by  the  following  argument.  Suppose  that  Xi  ^  0  for  all  i  =  1 , ...  ,1V  and 
that  x  £  J-(y)  is  a  minimal  £ \ -norm  solution.  Pick  now  any  r\  £  J\f ,  rj  ^  0,  and  pick  t  >  0  so  that 
t  <  min^.^o  |a;i|/|j?i|;  it  then  follows  that  s*  =  sign(xj  +  tr]i)  for  all  i  =  1, . . . ,  N.  But  then  we  have 
\xi  +  tVi\  =  si(xi  +  tVi)  =  J2iLi  lx*l  by  (2.4),  so  that  x  +  t,r]  is  also  a  minimal  solution, 
different  from  x.  Hence,  unique  £i-minimizers  are  necessarily  fc-sparse  for  some  k  <  N.  □ 

We  next  consider  minimization  in  a  weighted  £2  (m)-norm.  We  suppose  that  the  weight  w  is 
strictly  positive  which  we  define  to  mean  that  Wj  >  0  for  all  j  £  {1, . . . ,  N}.  In  this  case,  £2  (w)  is 
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a  Hilbert  space  with  the  inner  product 


N 

(u,v)w  :=  YwjUjVj. 

3= 1 


(2.5) 


We  define 

xw  :=  argmin||^||^M.  (2.6) 

z£F{y) 

Because  the  ||  ■  l^'^-norm  is  strictly  convex,  the  minimizer  xw  is  necessarily  unique;  it  is  completely 
characterized  by  the  orthogonality  conditions 

(xw,r])w  =  0,  VrjeAf.  (2.7) 

Namely,  xw  necessarily  satisfies  (2.7);  on  the  other  hand,  any  element  z  E  J~(y)  that  satisfies 
(z,  rj)w  =  0  for  all  rj  E  M  is  automatically  equal  to  xw. 

At  this  point,  we  would  like  to  tabulate  some  of  the  notation  we  have  used  in  this  paper  to 
denote  various  kinds  of  minimizers  and  other  solutions  alike  (such  as  limits  of  algorithms). 


z 

an  (arbitrary)  element  of  P(y) 

X 

any  solution  of  min  \\z 

z&T{y) 

X* 

unique  solution  of  min  (notation  used  only  when  the  minimizer  is  unique) 

z&T(y) 

xw 

unique  solution  of  min  ||^||<2(lu),  Wj  >  0  for  all  j 
z^T(y) 

X 

limit  of  Algorithm  1 

Xe 

unique  solution  of  min  fe(z);  see  (5.4) 
zeT(y) 

Table  1:  Notation  for  solutions  and  minimizers. 


3  The  Restricted  Isometry  and  the  Null  Space  Properties 

To  analyze  the  convergence  of  our  algorithm,  we  shall  impose  the  Restricted  Isometry  Property 
(RIP)  already  mentioned  in  the  introduction,  or  a  slightly  weaker  version,  the  Null  Space  Property , 
which  will  be  defined  below.  Recall  that  satisfies  RIP  of  order  L  for  5  E  (0, 1)  (see  (1.11))  iff 

(1  —  (5)||^||^  ^  ||d>z||£m  ^  (1  +  <5)||z||^,  for  all  L-sparse  z.  (3.1) 

It  is  known  that  many  families  of  matrices  satisfy  the  RIP.  While  there  are  deterministic  families 
that  are  known  to  satisfy  RIP,  the  largest  range  of  L,  (asymptotically,  as  N  — *  oo,  with  e.g.  m/N 
kept  constant)  is  obtained  (to  date)  by  using  random  families.  For  example,  random  families  in 
which  the  entries  of  the  matrix  d>  are  independent  realizations  of  a  (fixed)  Gaussian  or  Bernoulli 


random  variable  are  known  to  have  the  RIP  with  high  probability  for  each  L  ^  co(h)n/logn  (see 
[7,  4,  2,  35]  for  a  discussion  of  these  results). 

We  shall  say  that  $  has  the  Null  Space  Property  (NSP)  of  order  L  for  7  >  0  if  1 

\\irrWh  ^  7ll»7Tclk>  (3.2) 

for  all  sets  T  of  cardinality  not  exceeding  L  and  all  7 7  E  M .  Here  and  later,  we  denote  by  75  the 
vector  obtained  from  77  by  setting  to  zero  all  coordinates  77*  for  i  ^  S  C  {1,2, . . .  ,  IV};  Tc  denotes 
the  complement  of  the  set  T.  It  is  shown  in  Lemma  4.1  of  [18]  that  if  4>  has  the  RIP  of  order 
L  :=  J  +  J'  for  a  given  5  E  (0, 1),  where  J,  J'  ^  1  are  integers,  then  4>  has  the  NSP  of  order  K  for 
7  :=  yjjj.  Note  that  if  J'  is  sufficiently  large  then  7  <  1. 

Another  result  in  [18]  (see  also  Lemma  4.3  below)  states  that  in  order  to  guarantee  that  a 
k- sparse  vector  x*  is  the  unique  Li -minimizer  in  JF(y),  it  is  sufficient  that  4>  has  the  NSP  of  order 
L  ^  k  and  7  <  1.  (In  fact,  the  argument  in  [4],  proving  that  for  4>  with  the  RIP,  7 1 -minimization 
identifies  sparse  vectors  in  can  be  split  into  two  steps:  one  that  implicitly  derives  the  NSP 

from  the  RIP,  and  the  remainder  of  the  proof,  which  uses  only  the  NSP.) 

Note  that  if  the  NSP  holds  for  some  order  L$  and  constant  70  (not  necessarily  <  1),  then, 
by  choosing  a  >  0  sufficiently  small,  one  can  ensure  that  4?  has  the  NSP  of  order  L  =  aL 0  with 
constant  7  <  1  (see  [18]  for  details).  So  the  effect  of  requiring  that  7  <  1  is  tantamount  to  reducing 
the  range  of  L  slightly. 

When  proving  results  on  the  convergence  of  our  algorithm  later  in  this  paper,  we  shall  state 
them  under  the  assumptions  that  4>  has  the  NSP  for  some  7  <  1  and  an  appropriate  value  of  L. 
Using  the  observations  above,  they  can  easily  be  rephrased  in  terms  of  RIP  bounds  for  4>. 

4  Preliminary  results 

We  first  make  some  comments  about  the  decreasing  rearrangement  r(z)  and  the  j-term  approxima¬ 
tion  errors  for  vectors  in  M'v.  Let  us  denote  by  the  set  of  all  x  E  such  that  ^(supp(a:))  ^  k. 
For  any  z  E  and  any  j  =  1,2,...,  iV,  we  denote  by 

°j(z)h  ■=  in|  Ik  (4-1) 

the  i  1  -error  in  approximating  a  general  vector  z  E  by  a  j-sparse  vector.  Note  that  these 
approximation  errors  can  be  written  as  a  sum  of  entries  of  r(u):  =  'Yhv>jr{z)v  We  have 

the  following  lemma: 

1This  definition  of  the  Null  Space  Property  is  a  slight  variant  of  that  given  in  [18]  but  is  more  convenient  for  the 
results  in  the  present  paper. 
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Lemma  4.1  The  map  z  i— >  r(z)  is  Lipschitz  continuous  on  (M^,  ||  ■  H^):  for  any  z,z'  E  ,  we 
have 

II r(z)  -  r(z')\\£oa  <  \\z  -  z' ||<ot).  (4.2) 

Moreover,  for  any  j,  we  have 

Wj(z)ii  -  <Tj(z')e J  ^  IN  -  z' IUu  (4.3) 

and  /or  any  J  >  j,  we  have 

(J  ~j)r(z)j  ^  \\z-z'\\h  +aj(z')h.  (4.4) 

Proof:  For  any  pair  of  points  z  and  z\  and  any  j  E  {1, . . . ,  iV},  let  A  be  a  set  of  j  —  1  indices 

corresponding  to  the  j  —  1  largest  entries  in  z' .  Then 

r(z)j  ^  max  |z«|  <  max|z;|  +  \\z  -  z'\\too  =  r{z')j  +  \\z  -  z'\\e. OQ.  (4.5) 

igAc  is  Ac 

We  can  also  reverse  the  roles  of  z  and  z' .  Therefore,  we  obtain  (4.2).  To  prove  (4.3),  we  approximate 
2  by  a  j-term  best  approximation  u  E  of  z'  in  N .  Then 

TtUk  ^  |N  -  n||^  ^  |N  -  z' \\ei  +  Cj{z!)tl, 
and  the  result  follows  from  symmetry. 

To  prove  (4.4),  it  suffices  to  note  that  (J  —  j )  r(z)j  ^  &j(z)e x.  ■ 

Our  next  result  is  an  approximate  reverse  triangle  inequality  for  points  in  T(y).  Its  importance 
to  us  lies  in  its  implication  that  whenever  two  points  z,  z!  E  J~(y)  have  close  U -norms  and  one  of 
them  is  close  to  a  fc-sparse  vector,  then  they  necessarily  are  close  to  each  other.  (Note  that  it  also 
implies  that  the  other  vector  must  then  also  be  close  to  that  fc-sparse  vector.)  This  is  a  geometric 
property  of  the  null  space. 

Lemma  4.2  Assume  that  (3.2)  holds  for  some  L  and  7  <  1.  Then,  for  any  z,z'  E  F(y),  we  have 

IN'  -  z\\h  ^  (IN'Ik  -  INIUi  +  2o-lNK)  •  (4.6) 

Proof:  Let  T  be  a  set  of  indices  of  the  L  largest  entries  in  z.  Then 

I \{z' —  z)t<^\\(.1  ^  INrclk  +  INtcNi 

=  IN'Ik  -  INrlk  +o’l(*)*i 
=  INIUi  +  IN'Ik  -  INIUi  -  Mki  +  crL(z)h 
=  I \zt\W  -  INrlk  +  IN'Ik  -  INIUi  + 2 <uNNk 

<  \\{z' -  z)T\\i1  +  \\z'\\tl-\\z\\il+2aL{z)i1.  (4.7) 

Using  (3.2),  this  gives 

II (N  -  NHk  ^  7II (-'  -  z)t4ii  <  7(II(N  -  *)rlk  +  IN'IUi  -  INIUi  +  2°r(Nk)-  (4-8) 
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In  other  words 


(4.9) 


Ilk'-^Mk  s  — (Ik'IU,  -  |kl|<I  +  2^iW/l). 

1-7 

Using  this,  together  with  (4.7),  we  obtain 

Ik'  -  4t,  =  11(2'  - *MI &  +  Ilk'  - 2)tI| i,  <  h^dk'lk  -  IN*  +  2 (4.10) 

I-7 

as  desired.  ■ 


This  result  then  allows  the  following  simple  proof  of  some  of  the  results  of  [18]: 

Lemma  4.3  Assume  that  (3.2)  holds  for  some  L  and  7  <  1.  Suppose  that  iF(y)  contains  an  L- 
sparse  vector.  Then  this  vector  is  the  unique  l\-minimizer  in  iF(y);  denoting  it  by  x* ,  we  have 
moreover,  for  all  v  E  f(y), 

\\v-x*  |k  <  2  1  +  7  aL(v )it  .  (4.11) 

1-7 

Proof:  For  the  time  being,  we  denote  the  L-sparse  vector  in  J-(y)  by  xs. 

Applying  (4.6)  with  z'  =  v  and  z  =  xs,  we  find 

1  +  7 

Ik  -  xs\\h  ^  y— [ Iklk  -  Ikkk] ; 

since  v  S  F(y)  is  arbitrary,  this  implies  that  |klk  —  Iks  1 1  £1  ^  0  for  all  v  £  F{y),  so  that  xs  is  an 
^1-norm  minimizer  in  J-{y). 

If  x'  were  another  l \ -minimizer  in  iF(y),  then  it  would  follow  that  Ikkn  =  Iks  Ik;  and  the 
inequality  we  just  derived  would  imply  ||x/  —  ^slk  =  0,  or  x'  =  xs.  It  follows  that  xs  is  the  unique 
fi-minimizer  in  J-(y),  which  we  denote  by  x*,  as  proposed  earlier. 

Finally,  we  apply  (4.6)  with  z'  =  x*  and  z  =  v,  and  we  obtain 

Ik  -  x*\\  <  7^(lk*ki  -  IklUi  +  2ctl(u)^i)  <  2^-^-(TL(v)tl , 

1  —  7  1  —  7 

where  we  have  used  the  i\ -minimization  property  of  x* .  m 


Our  next  set  of  remarks  centers  around  the  functional  J  defined  by  (1.5).  Note  that  for  each 


n  =  1,2, 


we  have 


N 


3= 1 


We  also  have  the  following  monotonicity  property  which  holds  for  all  n  ^  0: 

J{xn+\wn+\en+ 1)  si  J(xn+1,wn,en+1)  st  J(xn+1,  wn,  en)  ^  J(xn,wn,en). 


(4.12) 


(4.13) 


Here  the  first  inequality  follows  from  the  minimization  property  that  defines  wn+1,  the  second 
inequality  from  en+i  ^  en,  and  the  last  inequality  from  the  minimization  property  that  defines 
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xn+1.  For  each  n,  xn+l  is  completely  determined  by  wn ;  for  n  =  0,  in  particular,  x1  is  determined 
solely  by  w°,  and  independent  of  the  choice  of  x°  £  IF{y).  (With  the  initial  weight  vector  defined 
by  w°  =  (1, . . . ,  1),  x1  is  the  classical  minimum  ^-norm  element  of  J-(y).)  The  inequality  (4.13) 
for  n  =  0  thus  holds  for  arbitrary  x°  £  F(y). 

Lemma  4.4  For  each  n  ^  1  we  have 

\\xn\\e1  ^  J(x1,w°,e0)  =:  A 

and 

w?>A~\  j  =  1, N. 

Proof:  The  bound  (4.14)  follows  from  (4.13)  and 

N 

\\xn\\h  <  EK*")2  +  £n]1/2  =  J(xn,Wn,en). 

3= 1 

The  bound  (4.15)  follows  from  (re™)"1  =  [(a+2  +  e2]1/2  ^  J(xn,wn,en)  ^  A,  where  the  last  in¬ 
equality  uses  (4.13).  ■ 


(4.14) 

(4.15) 


5  Convergence  of  the  algorithm 

In  this  section,  we  prove  that  the  algorithm  converges.  Our  starting  point  is  the  following  lemma 
that  establishes  ( xn  —  xn+1 )  — >  0  for  n  — »  oo. 


Lemma  5.1  Given  any  y  £  M.m,  the  xn  satisfy 


Ei 

71=1 


\xn+1  -xn\\j2  <  2 A2. 


r-n  _  ^+0  _  Q 


where  A  is  the  constant  of  Lemma  4.4.  In  particular,  we  have 

lim  ( xn  -  xr 

n — >  ~  ~ 

Proof:  For  each  n  =  1,2,...,  we  have 

n+1  n+l 


2  {J(xn,wn,en)-J(xn+1,wn+L,en+1))  ft  2[J(xn,wn,en)-J(xn+fwn,en)} 

=  (xn,Xn)wn-(xn+1,Xn+1)wn 
=  ( xn  +  xn+1,xn  -xn+1)wn 

=  (xn  -Xn+1,Xn  -xn+1)wn 


N 


E<(+-++1) 

3=  1 


(5.1) 


(5.2) 
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(5.3) 


>  A-'Wx"  -  xn+1\\l, 

where  the  third  equality  uses  the  fact  that  ( xn+1,xn  —  xn+1)wn  =  0  (observe  that  xn+1  —  xn  £  M 
and  invoke  (2.7)),  and  the  inequality  uses  the  bound  (4.15)  on  the  weights.  If  we  now  sum  these 
inequalities  over  n  ^  1,  we  arrive  at  (5.1).  ■ 


Prom  the  monotonicity  of  en,  we  know  that  e  :=  lim^—*.*,  en  exists  and  is  non-negative.  The 
following  functional  will  play  an  important  role  in  our  proof  of  convergence: 

N 

/,(*):=  ^(^Te2)1/2.  (5.4) 

3=  1 

Notice  that  if  we  knew  that  xn  converged  to  x  then,  in  view  of  (4.12),  f£(x)  would  be  the  limit  of 
J{xn ,  wn,  en).  When  e  >  0  the  functional  fe  is  strictly  convex  and  therefore  has  a  unique  minimizer 

xe  :=  argmin  f£(z).  (5-5) 

z&F(y) 

This  minimizer  is  characterized  by  the  following  lemma: 

Lemma  5.2  Let  e  >  0  and  z  £  lF(y).  Then  z  =  xe  if  and  only  if  (z,rf)yj(Zje\  =  0  for  all  r]  £  M , 
where  w(z,e)i  =  [z2  +  e2]-1/2. 

Proof:  For  the  “only  if”  part,  let  z  =  xe  and  r/  £  J\f  be  arbitrary.  Consider  the  analytic  function 

Ge{t)  ■■=  fe(z  +  trj)  -  fe(z). 

We  have  Ge(0)  =  0,  and  by  the  minimization  property  Ge(t)  ^  0  for  all  t  £  M.  Hence,  G'e(fS)  =  0. 
A  simple  calculation  reveals  that 

N 

G6(°)  =  J2  [,2^2]  1/2  = 

which  gives  the  desired  result. 

For  the  “if”  part,  assume  that  z  £  lF(y)  and  {z,rf)qjiz^\  =  0  for  all  r;  £  A 7,  where  w(z,e )  is 
defined  as  above.  We  shall  show  that  z  is  a  minimizer  of  f£  on  Indeed,  consider  the  convex 

univariate  function  [u2  +  e2]1/2.  For  any  point  uq  we  have  from  convexity  that 

[' u 2  +  e2]1/2  ^  [uq  +  e2]i/2  +  [uq  +  e2]~1/2Uo(u  -  u0),  (5.6) 

because  the  right  side  is  the  linear  function  which  is  tangent  to  this  function  at  u o-  It  follows  that 
for  any  point  v  £  T (y)  we  have 

N 

fe{v)  >  fe(z)  +  ^2[z2  +  e2]~1/2Zj{Vj  -  Zj)  =  f€{z )  +  (z,V  -  z)yj(z,£)  =  /e(«),  (5-7) 

3= 1 
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where  we  have  used  the  orthogonality  condition  (5.13)  and  the  fact  that  v  —  z  is  in  J\f .  Since  v  is 
arbitrary,  it  follows  that  z  =  xe,  as  claimed.  ■ 


We  now  give  the  convergence  of  the  algorithm. 


Theorem  5.3  Let  K  (the  same  index  as  used  in  the  update  rule  (1-7))  be  chosen  so  that  satisfies 
the  Null  Space  Property  (3.2)  of  order  K,  with  7  <  1.  Then,  for  each  y  e  Wn,  the  output  of 
Algorithm  1  converges  to  a  vector  x,  with  r(x)x+ 1  =  IV  limn_ >00  cn  and  the  following  hold: 

(i)  If  e  =  lim^^oo  en  =  0,  then  x  is  K -sparse;  in  this  case  there  is  therefore  a  unique  l\-minimizer 
x* ,  and  x  =  x* ;  moreover,  we  have,  for  k  ^  K ,  and  any  z  G  IF(y), 


z  -  x\\tl  ^  cak(z)h, 


with  c  :  = 


2(1  +  7) 

1-7 


(5.8) 


(ii)  If  e  =  linin^oo  en  >  0,  then  x  =  xe; 

(iii)  In  this  last  case,  if  7  satisfies  the  stricter  bound  7  <  1  —  (or,  equivalently,  if  <  K), 
then  we  have,  for  all  z  &  IF (y)  and  any  k  <  K  —  that 


z  —  x\\ ^  ca]e(z)g1,  with  c 


2(1+7)  K-k  +  l 

1-7  K-k-T*- 

L  1-7 


As  a  consequence,  this  case  is  excluded  if  IF {y)  contains  a  vector  of  sparsity  k  <  K  —  jffz- 


(5.9) 


The  constant  c  can  be  quite  reasonable;  for  instance,  if  7  ^  1/2  and  k  ^  K  —  3,  then  we  have 
c  ^  9  <  27. 


Proof:  Note  that  since  en+\  <  en,  the  en  always  converge.  We  start  by  considering  the  case 

e  :=  lirn^oo  en  =  0. 

Case  e  =  0:  In  this  case,  we  want  to  prove  that  xn  converges  ,  and  that  its  limit  is  an  I\- 
minimizer.  Suppose  that  enQ  =  0  for  some  no-  Then  by  the  definition  of  the  algorithm,  we  know 
that  the  iteration  is  stopped  at  n  =  no,  and  xn  =  xno ,  n  ^  no-  Therefore  x  =  xn°.  From  the 
definition  of  en,  it  then  also  follows  that  r(xno)x+i  =  0  and  so  x  =  xno  is  A"-sparse.  As  noted  in 
§3  and  Lemma  4.3,  if  a  iV-sparse  solution  exists  when  <h  satisfies  the  NSP  of  order  K  with  7  <  1, 
then  it  is  the  unique  ^i-minimizer.  Therefore,  x  equals  x*,  this  unique  minimizer. 

Suppose  now  that  en  >  0  for  all  n.  Since  en  — >  0,  there  is  an  increasing  sequence  of  indices  (n*) 
such  that  eni  <  eni_i  for  all  i.  By  the  definition  (1.7)  of  (en)nSN,  we  must  have  r(xni)K+ 1  <  Aren.t-i 
for  all  i.  Noting  that  (xn)neN  is  a  bounded  sequence,  there  exists  a  subsequence  (Pj)jeN  of  (ni)ieN 
such  that  (xPj )jGpj  converges  to  a  point  x  6  IF(y).  By  Lemma  4.1,  we  know  that  r(xPj)x+ 1  converges 
to  t(x)k+ i-  Hence  we  get 


r(x)K+ 1  =  lim  r(xPj)K+ 1  +  lim  Nep  \  =  0, 

j—>oo  j— >00 


(5.10) 
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which  means  that  the  support- width  of  x  is  at  most  K,  i.e.  x  is  iL-sparse.  By  the  same  token  used 
above,  we  again  have  that  x  =  x* ,  the  unique  ti-minimizer.  We  must  still  show  that  xn  — »  x* . 
Since  xpi  — >  x*  and  ePj  — >  0,  (4.12)  implies  J(xpi,  wpi ,  ep.)  — >  ||a:*||f1.  By  the  monotonicity  property 
stated  in  (4.13),  we  get  J(xn,wn,en)  — >  ||s’l‘||f1.  Since  (4.12)  implies 

J(xn,wn,en )  -  Nen  <  ||xn||^  <  J(xn,  wn,  en),  (5.11) 

we  obtain  ||a;ri||^1  — >  ||x*||4.  Finally,  we  invoke  Lemma  4.2  with  z!  =  xn,  z  =  x*,  and  k  =  K  to  get 

limsup  \\xn  —  x*\\z1  ^  ]  +  7  (  lim  ||xn||,£1  -  ||x*||^)  =0,  (5-12) 

n— >oo  1-7  Vn^oo  / 

which  completes  the  proof  that  xn  — *  x*  in  this  case. 

Finally,  (5.8)  follows  from  (4.11)  of  Lemma  4.3  (with  L  =  K),  and  the  observation  that  crn(z)  ^ 
crni(z)  if  n  ^  n! . 

Case  e  >  0:  We  shall  first  show  that  xn  — >  xe,  n  — »  oo,  with  xe  as  defined  by  (5.5).  By  Lemma 
4.4,  we  know  that  (x?l)^=1  is  a  bounded  sequence  in  and  hence  this  sequence  has  accumulation 
points.  Let  ( xni )  be  any  convergent  subsequence  of  ( xn )  and  let  x  £  F{y)  be  its  limit.  We  want  to 
show  that  x  =  xe. 

Since  w™  =  [(x™)2  +  e2]-1/2  ^  e_1,  it  follows  that  lirn^oo wr-1  =  [(x.,)2  +  e2]_1//2  =  w(x,e)j 
=:  Wj,  j  =  1 , ...  ,7V.  On  the  other  hand,  by  invoking  Lemma  5.1,  we  now  find  that  xni+1  — >  x, 
i  — >  oo.  ft  then  follows  from  the  orthogonality  relations  (2.7)  that  for  every  7  £  J\T,  we  have 

(x,  77)55  =  lim  (xni+1,rj)wni  =  0.  (5.13) 

l— >OC 

Now  the  “if”  part  of  Lemma  5.2  implies  that  x  =  xe.  Hence  xe  is  the  unique  accumulation  point 
of  (xn)nSf^  and  therefore  its  limit.  This  establishes  (ii). 

To  prove  the  error  estimate  (5.9)  stated  in  (iii),  we  first  note  that  for  any  z  £  ^(y),  we  have 

\\xe\W  ^  /e(xe)  ^  fe{z)  ^  \\z\\h  +  Ne,  (5.14) 

where  the  second  inequality  uses  the  minimizing  property  of  xe.  Hence  it  follows  that  ||a:£||^1  — 
^  Ne.  We  now  invoke  Lemma  4.2  to  obtain 

||xe  -  z\\tl  ^  ^-^-[Ne  +  2ak(z)£l].  (5.15) 

1-7 

From  Lemma  4.1  and  (1.7),  we  obtain 

Ne=  lim  Nen^  lim  r(xn)K+ 1  =x(xeW+i-  (5.16) 

n— >og  n—>oo 

It  follows  from  (4.4)  that 

(K  +  1  —  k)Ne  ^  (K +  l-k)r{xe)K+i 
^  \\xe  ~  z\\h  +  crk(z)h 


15 


\-^[Ne  +  2ak{z)i1]  +  ak(z)e1 
1-7 


(5.17) 


where  the  last  inequality  uses  (5.15).  Since  by  assumption  on  K ,  we  have  K  —  k  >  i.e. 

K  +  1  —  k  >  we  obtain 

Ar  .  .  2(K  —  k)  +  3 

Ne  +  2 ak(z)h  ^  ^  ^  2~  ak(z)h. 

Using  this  back  in  (5.15),  we  arrive  at  (5.9). 

Finally,  notice  that  if  J-{y)  contains  a  fc-sparse  vector  (with  k  <  K  —  fr^),  then  we  know  already 
(see  §3)  that  this  must  be  the  unique  ^i-minimizer  x*\  it  then  follows  from  our  arguments  above 
that  we  must  have  e  =  0.  Indeed,  if  we  had  e  >  0,  then  (5.17)  would  hold  for  z  =  x*\  since  x*  is 
fc-sparse,  o‘/c(x*)^1  =  0,  implying  e  =  0,  a  contradiction  with  the  assumption  e  >  0.  This  finishes 
the  proof. 


Remark  5.4  Let  us  briefly  compare  our  analysis  of  the  IRLS  algorithm  with  t\  minimization.  The 
latter  recovers  a  &;-sparse  solution  (when  one  exists)  if  has  the  NSP  of  order  K  and  k  ^  K.  The 
analysis  given  in  our  proof  of  Theorem  5.3  guarantees  that  our  IRLS  algorithm  recovers  fc-sparse  x 
for  a  slightly  smaller  range  of  values  k  than  7 1 -minimization,  namely  for  k  <  K  —  jz~-  Notice  that 
this  “gap”  vanishes  for  vanishingly  small  7.  Although  we  have  no  examples  to  demonstrate,  our 
arguments  cannot  exclude  the  case  where  J~(y)  contains  a  fc-sparse  vector  x*  with  K—j^~  ^  k  ^  K 
(e.g.,  if  7  ^  1/3  and  k  =  K  —  1),  and  our  IRLS  algorithm  converges  to  x,  yet  x  /  x*.  However, 
note  that  unless  7  is  close  to  1,  the  range  of  k- values  in  this  “gap”  is  fairly  small;  for  instance,  for 
7  <  g,  this  non-recovery  of  a  £;-sparse  x*  can  happen  only  if  k  =  K.  □ 

Remark  5.5  The  constant  c  in  (5.8)  is  clearly  smaller  than  the  constant  c  in  (5.9);  it  follows  that 
when  k  <  K  —  the  estimate  (5.9)  holds  for  all  cases,  regardless  of  whether  e  =  0  or  not.  □ 

6  Rate  of  Convergence 

Under  the  conditions  of  Theorem  5.3  the  algorithm  converges  to  a  limit  x;  if  there  is  a  /c-sparse 
vector  in  J-(y)  with  k  <  K  —  ^7-,  then  this  limit  coincides  with  that  £;-sparse  vector,  which  is  then 
also  automatically  the  unique  Li-rriinimizer  x* .  In  this  section  our  goal  is  to  establish  a  bound  for 
the  rate  of  convergence  in  both  the  sparse  and  non-sparse  cases.  In  the  latter  case,  the  goal  is  to 
establish  the  rate  at  which  xn  approaches  to  a  ball  of  radius  Cicrk(x*)ei  centered  at  x*.  We  shall 
work  under  the  same  assumptions  as  in  Theorem  5.3. 
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6.1  Case  of  L-sparse  vectors 

Let  us  begin  by  assuming  that  J-(y)  contains  the  fc-sparse  vector  x*.  The  algorithm  produces  the 
sequence  xn,  which  converges  to  x*,  as  established  above.  Let  us  denote  the  (unknown)  support  of 
the  fc-sparse  vector  x*  by  T. 

We  introduce  an  auxiliary  sequence  of  error  vectors  rjn  £  J\f  via  rjn  :=  xn  —  x*  and 


En  :  = 


Mi 


=  \x  —  x 


\eN. 


We  know  that  En  — >  0.  The  following  theorem  gives  a  bound  on  the  rate  of  convergence  of  En  to 
zero. 


Theorem  6.1  Assume  satisfies  NSP  of  order  K  with  constant  7  such  that  0  <  7  <  1  —  ■ 


Suppose  that  k  <  K  —  ,  0  <  p  <  1,  and  0  <  7  <  1  —  -jApi  are  su°h  that 


T  :  = 


7(1  +7) 

1  ~P 


1  + 


K  +  l-k 


<  1. 


Assume  that  E{y)  contains  a  k-sparse  vector  x*  and  let  T  =  supp(x*).  Let  no  be  such  that 


En  ^  R*  :=  p  min|x*|. 


i&T 


(6.1) 


Then  for  all  n  ^  no,  we  have 


En+ 1  7  p  En. 

Consequently  xn  converges  to  x*  exponentially. 

Remark  6.2  Notice  that  if  7  is  sufficiently  small,  e.g.  7(1+7)  <  f ,  then  for  any  k  <  K,  there  is 
a  p  >  0  for  which  p  <  1,  so  we  have  exponential  convergence  to  x*  whenever  x*  is  £>sparse.  □ 

Proof:  We  start  with  the  relation  (2.7)  with  w  =  wn,  xw  =  xn+l  =  x*  +  pn+1,  and  7  = 

xn+1  —  x*  =  r)n+1,  which  gives 

N 


E«+<+1H"+1”’"  =  °- 


i—  1 

Rearranging  the  terms  and  using  the  fact  that  x*  is  supported  on  T,  we  get 
N 


E  w 

i=l 


w'i  =  E  '  u"'  = 

i&T 


E 

i&T 


XJ 


[«)2  +  e2]i/2 


<+1- 


(6.2) 


We  will  prove  the  theorem  by  induction.  Let  us  assume  that  we  have  shown  En  7  R*  already. 
We  then  have,  for  all  i  E  T. 

K\  <  hn\U?  =  En  <  p\x*\  , 

so  that 


l 


[(xf)2  +  e2]1/2  "  \x?\  \x*  fi)?ri-/)’ 


(6.3) 
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and  hence  (6.2)  combined  with  (6.3)  and  NSP  gives 


N 


£  l»."+1l2<  < 


i= 1 


7 - ll<??+1|lft'(TJ-|l"??t1|lf, 

1  ~P  1  ~  P 


At  the  same  time,  the  Cauchy-Schwarz  inequality  combined  with  the  above  estimate  yields 


n+l  1 1 2 

1tc  I  hi 


<: 


,”+1|V‘ 


.  I ,  £[«)2  +  e 

,ieTc 


ai/2 


£ 

KieT' 

Ei-r'lW)  f£[(f”)2  +  4l1/2 

W=1  /  \«STC  / 

ll^Ikdl^lk+iVc^. 


I  il  n+l  | 


1  ~  P 


(6.4) 


If  rpft1  =  0,  then  x^t1  =  0.  In  this  case  xn+1  is  fc-sparse  and  the  algorithm  has  stopped  by 
definition;  since  xn+1  —  x*  is  in  the  null  space  J\T,  which  contains  no  ^-sparse  elements  other  than 
0,  we  have  already  obtained  the  solution  xn+1  =  x*.  If  rfift1  ^  0,  then  after  canceling  the  factor 
II^T^Ik  in  (6.4),  we  get 


T^'lk 


7 


1  ~  P 


+  Nen) 


,n+l||.  _  lun+lii  I  ||„n+l|i  /  n  I  „,MI„n+l|L  ^  7(1  +  7) 


and  thus 

kk  =  ll^+iiu1  +  K+dk  <  (i  +  7)ll^1lk  <  -3 

Now,  we  also  have  by  (1.7)  and  (4.4) 


+  Nen).  (6.5) 


Nen  <  r(xn)K+ l  ^ 


1 


\xn  -x*\\h  +crk(x*)i1)  = 


ki 


K+l-kK"~  -  M-i  ■  R  +  1_k > 

since  by  assumption  ak{x*)  =  0.  This,  together  with  (6.5),  yields  the  desired  bound, 

7(1  +  7)  (-  1 


(6.6) 


En+ 1  — 


n+l  I 


hi 


1  ~P 


1  + 


K  +  l-k 


Ik  —  pEn 


In  particular,  since  p  <  1,  we  have  En+\  ^  R* ,  which  completes  the  induction  step.  It  follows  that 
En+i  ^  pEn  for  all  n  ^  no-  ■ 


Remark  6.3  Note  that  the  precise  update  rule  (1.7)  for  en  does  not  really  intervene  in  this  analysis. 
If  Eno  ^  R* ,  then  the  estimate 

En+ 1  ^  Mo (En  +  Nen)  with  /r0  :=  7(1  +  y)/(l  -  p)  ,  (6.7) 

guarantees  that  all  further  En  will  be  bounded  by  R*  as  well,  provided  Nen  ^  (p^1  —  1  )R*.  It 
is  only  in  guaranteeing  that  (6.1)  must  be  satisfied  for  some  no  that  the  update  rule  plays  a  role: 
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indeed,  by  Theorem  5.3,  En  — >  0  for  n  — >  oo  if  en  is  updated  following  (1.7),  so  that  (6.1)  has  to 
be  satisfied  eventually. 

Other  update  rules  may  work  as  well.  If  (en)neN  is  defined  so  that  it  is  a  monotonically  decreasing 
sequence  with  limit  e,  then  the  relation  (6.7)  immediately  implies  that 

limsupi^  ^  - - . 

n — »oo  J-  /^0 

In  particular,  if  e  =  0,  then  En  — >  0.  The  rate  at  which  En  — >  0  in  this  case  will  depend  on  / iq  as 
well  as  on  the  rate  with  which  en  —>  0.  We  shall  not  quantify  this  relation,  except  to  note  that  if 
en  =  0((3n)  for  some  (3  <  1,  then  En  =  0(njln)  where  J1  =  max(/io,/3).  □ 

6.2  Case  of  noisy  k- sparse  vectors 

We  show  here  that  the  exponential  rate  of  convergence  to  a  ^-sparse  limit  vector  can  be  extended  to 
the  case  where  the  “ideal”  (i.e.  £;-sparse)  target  vector  has  been  corrupted  by  noise  and  is  therefore 
only  “approximately  fc-sparse”.  More  precisely,  we  no  longer  assume  that  E(y)  contains  a  /c-sparse 
vector;  consequently  the  limit  x  of  the  xn  need  not  be  an  tf-minimizer  (see  Theorem  5.3).  If  x  is 
any  1? i -minimizer  in  E(y),  Theorem  5.3  guarantees  ||x  —  x\\^  ^  CcTk(x)g1\  since  this  is  the  best  level 
of  accuracy  guaranteed  in  the  limit,  we  are  in  this  case  interested  only  in  how  fast  xn  will  converge 
to  a  ball  centered  at  x  with  radius  given  by  some  (prearranged)  multiple  of  crk(x)e1-  (Note  that  if 
E(y)  contains  several  £i-minimizers,  they  all  lie  within  a  distance  C'a^x)^  of  each  other,  so  that 
it  does  not  matter  which  x  we  pick.)  We  shall  express  the  notion  that  z  is  “approximately  fc-sparse 
with  gap  ratio  Cn ,  or  a  “noisy  version  of  a  /c-sparse  vector,  with  gap  ratio  C”  by  the  condition 

r(z)k  ^  Cak(z)ei 

where  k  is  such  that  $  has  the  NSP  for  some  pair  K,  7  such  that  0  ^  k  <  K  —  (e.g.  we  could 

have  K  =  k  +  liiri<  1/2).  If  the  gap  ratio  C  is  much  greater  than  the  constant  C\  in  (5.9),  then 
exponential  convergence  can  be  exhibited  for  a  meaningful  number  of  iterations.  Note  that  this 
class  includes  perturbations  of  any  fc-sparse  vector  for  which  the  perturbation  is  sufficiently  small 
in  l?1 -norm  (when  compared  to  the  unperturbed  fc-sparse  vector). 

Our  argument  for  the  noisy  case  will  closely  resemble  the  case  for  the  exact  fc-sparse  vectors. 
However  there  are  some  crucial  differences  that  justify  our  decision  to  separate  these  two  cases. 

We  will  be  interested  in  only  the  case  e  >  0  where  we  recall  that  e  is  the  limit  of  the  en  occurring 
in  the  algorithm,  This  assumption  implies  £Jfc(x)^1  >  0,  and  can  only  happen  if  x  is  not  K- sparse. 
(As  noted  earlier,  the  exact  fc-sparse  case  always  corresponds  to  e  =  0  if  k  <  K  —  For  k  in  the 
region  K  —  ^  k  ^  K,  both  e  =  0  and  e  >  0  are  theoretical  possibilities.) 

First,  we  redefine  rjn  =  xn  —  xe,  where  xe  is  the  minimizer  of  fe  on  E(y)  and  e  >  0.  We  know 
from  Theorem  5.3  that  rjn  — *  0.  We  again  set  En  = 
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Theorem  6.4  Given  0  <  p  <  1,  and  integers  k,  K  with  k  <  K,  assume  that  satisfies  the  NSP  of 
order  K  with  constant  7  such  that  all  the  conditions  of  Theorem  5.3  are  satisfied  and,  in  addition, 


T  ■= 


7(1  +  7) 

1  ~  P 


1 

K  +  1 


Suppose  z  E  E{y)  is  “approximately  k-sparse  with  gap  ratio  C”,  i.e. 


r(z)k  ^  Cak(z)h 


(6.8) 


with  C  fz  C\,  where  C\  is  as  in  Theorem  5.3.  Let  T  stand  for  the  set  of  indices  of  the  k  largest 
entries  of  xe,  and  no  he  such  that 


Eno  <  R*  ■=  pmin  \xj\  =  pr(xe)k ■  (6.9) 

i&T 

Then  for  all  n  no,  we  have 

En+ 1  ^  pEn  +  Bak(z)i1 ,  (6.10) 

where  B  >  0  is  a  constant.  Similarly,  if  we  define  En  =  \\xn  —  z\\£l,  then 

En+ 1  ^  pEn  +  Bak(z)£l,  (6.11) 

for  n  ^  no,  where  B  >  0  is  a  constant.  This  implies  that  xn  converges  at  an  exponential  (linear) 
rate  to  the  ball  of  radius  B{  1  —  ii)~1ak{z)£l  centered  at  z. 


Remark  6.5  Note  that  Theorem  5.3  trivially  implies  the  inequalities  (6.10)  and  (6.11)  in  the 
limit  n  — >  00  since  En  —>  0,  (rk{z)£l  >  0,  and  ||x  —  z\\£l  <  Ciak(z)£l.  However,  Theorem  6.4 
quantifies  the  event  when  it  is  guaranteed  that  the  two  measures  of  error,  En  and  En,  must  shrink 
(at  least)  by  a  factor  p  <  1  at  each  iteration.  As  noted  above,  this  corresponds  to  the  range 
ak{z)i1  <  En,En  <  r(xe)k,  and  would  be  realized  if,  say,  2  is  the  sum  of  a  k- sparse  vector  and  a 
fully  supported  “noise”  vector  which  is  sufficiently  small  in  l\  norm.  In  this  sense,  the  theorem 
shows  that  the  rate  estimate  of  Theorem  5.3  extends  to  a  neighborhood  of  /c-sparse  vectors. 


Proof:  First,  note  that  the  existence  of  no  is  guaranteed  by  the  fact  that  En  — »  0  and  R*  >  0. 

For  the  latter,  note  that  Lemma  4.1  and  Theorem  5.3  imply 

r(xe)k  >  r(z)k  -  ||2  -  xe\\£l  >  (C  -  Ci)ak(z)h, 

so  that  R*  ^  p(C  —  C\  )ak(z)f1  >  0. 

We  follow  the  proof  of  Theorem  6.1  and  consider 
not  sparse  in  general,  we  rewrite  (6.2)  as 

N  N 

i= 1  i= 1 


the  orthogonality  relation  (6.2).  Since  xt  is 


E 


„n+ 1 


\(rnV  -L  ^2 

«6TUTc  LV  * ' 


(6.12) 
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We  deal  with  the  contribution  on  T  in  the  same  way  as  before: 


£ 

i£T 


X; 


n+ 1 


[{xn?  +  e2n]l/2  H 


1  ~P 


iP'Wii 


€ 


)  1 1  Tl+l  I 


1  ~  P 


Itc  114 


For  the  contribution  on  Tc,  note  that 


Pn  :=  max  ■ 


h 


,n+ 1 


^  [(4)2  +  4]1/2 

Since  r]n  — >  0  we  have  /3n  — >  0.  It  follows  that 


«e"1|l>f+1|k 


v~ _ zi _ 

^c[(<)2  +  4]1/2 


4+1 


<  PnVk(x€)h  ^  Pn(^k{z)e!  +  ll^6  -  -Iki)  <  C2Pn(Xk{z)i1,  (6.13) 


where  the  second  inequality  is  due  to  Lemma  4.1,  the  last  one  to  Theorem  5.3,  and  C2  =  C\  +  1. 
Combining  these  two  bounds,  we  get 


N 


^ZK+1\2K  <  t^II 4^lk  +  C2f3nak{z)ei 


i=l 


1-/9 


We  combine  this  again  with  a  Cauchy-Schwarz  estimate,  to  obtain 


114; 


n+li|2  ^ 

4  + 


E  fo 


t+ii2< 


E  K4)2  +  e 


2 1 1/2 


ueTc 
<  N 


\i£Tc 


?+T< 


El  i 

i=l 

_ \\nn+1 

l-p\\ Vtc 


£>,"1  +  141+4 

aSTc  / 


<i  +  C2prL(Jk(z)e1  )  (||4clk  +  crk(xe)£ i  + 


€ 


-y ~ — -114^114  +  C2/3nOk(z)(i1  J  (||4clki  +  C2a'k{z')g1  +  Nen)  ,  (6-14) 

It  is  easy  to  check  that  if  u2  ^  Au  +  B ,  where  A  and  B  are  positive,  then  u  A  +  B/A.  Applying 


this  to  u  =  114c"1  Hl  br  the  above  estimate,  we  get 


7 


Ik  +  C2ak(z)^1  +  Nen]  +  CsPnak(z)£1, 


(6.15) 


1  ~P 

where  C3  =  6*2(1  —  p)/ 7-  Similar  to  (6.6),  we  also  have,  by  combining  (4.4)  with  (part  of)  the  chain 
of  inequalities  (6.13), 


JVt„<r(x")If+1<K  +  1_rll 
and  consequently  (6.15)  becomes 


xn  -  xe|k  +  ak{xe)tl)  < 


K  +  l-k 


k  +  C2ak(z)e1) ,  (6.16) 


_7l+l 


Ik  <  (1+7)114^114 


(6.17) 


£ 


7(1  +  7) 
1  ~  P 


1  + 


K  +  l-k 


+  (1  +  7)(C*3/3n  +  C4)  CTfc(z)£1  , 
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where  C4  =  627(1  —  p)  X(1  +  1  /(K  +  1  —  k)).  Since  the  (3n  are  bounded,  this  gives 


En+1  ^  pEn  T  B<Jk(z)i j. 

It  then  follows  that  if  we  pick  Jl  so  that  1  >  Jl  >  /1,  and  consider  the  range  of  n  >  no  such  that 
En  ^  (Jl  -  p,y1Bcrk(z)e1  =:r*,  then 

En- 1-1  ^  pEn. 

Hence  we  are  guaranteed  exponential  decay  of  En  as  long  as  xn  is  sufficiently  far  from  its  limit. 
The  smallest  possible  value  of  r*  corresponds  to  the  case  Jl  ~  1. 

To  establish  a  rate  of  convergence  to  a  comparably-sized  ball  centered  at  z,  we  consider  En  = 
\\xn  —  z\\ix.  It  then  follows  that 

H^n+i  _  xe^  _|_  ||xe  _  z||^ 

p\\xn  -  xe\\h  +  Bak(z)£l  +  Ci<7k(z)£l 
p\\xn  -  z\J  +  Bajz)^  +  6i(l  +  p)ak(z)h 
pEn  + Bajz)^,  (6.18) 

which  shows  the  claimed  exponential  decay  and  also  that 

lim  sup  En  ^B(  1-p)  1ok(z)i1. 


E , 


n+1 


7  Beyond  the  convex  case:  ^-minimization  for  r  <  1 

If  has  the  NSP  of  order  K  with  7  <  1,  then  (see  §3)  6 -minimization  recovers  K -sparse  solutions 
to  =  y  for  any  y  £  Mm  that  admits  such  a  ^-sparse  solution,  i.e. ,  £  1  -minimization  gives  also 
t'o-minimizers,  provided  their  support  has  size  at  most  k.  In  [29],  Gribonval  and  Nielsen  showed 
that  in  this  case,  £\ -minimization  also  gives  the  G— minimizers,  i.e.,  ^1-minimization  also  solves 
non-convex  optimization  problems  of  the  type 

x*  =  argmin  ||z||J;v,  for  0  <  r  <  1.  (7-1) 

Let  us  first  recall  the  results  of  [29]  that  are  of  most  interest  to  us  here,  reformulated  for  our 
setting  and  notations. 

Lemma  7.1  ([29,  Theorem  2]).  Assume  that  x*  is  a  K -sparse  vector  in  E(y)  and  that  0  <  r  ^  1. 

If 

1  N 

<  J2  \r]i\T  ,  or,  equivalently,  Z>r<2E  h\T , 
i£T  i£Tc  i£T  i=  1 
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for  all  rj  E  Af  and  for  all  T  C  {1, . . .  ,N}  with  ffT  ^  K ,  then 


x  =  argmm  ||z||^jv 

z&F(y) 


Lemma  7.2  ([29,  Theorem  5]).  Let  z  E  lw,  0  <  n  ^  T2  ^  1,  and  K  E  N.  TTien 


sup 


E 


ieT  Hu 


Tc{l,...,iV},#r^  Ei II  Nn  '  Tc{L...,Ar},#T^A'  EEl  NT2 

Combining  these  two  lemmas  with  the  observations  in  §3  leads  immediately  to  the  following 
result. 


£ 


sup 


E 


ieT  \  zi\ 


Theorem  7.3  Fix  any  0  <  r  ^  1.  If  &  satisfies  the  NSP  of  order  K  with  constant  7  then 

^2h\T  <  7  nt>  (7-2) 

*ST  i£Tc 

for  all  r)  G  Af  and  for  all  T  C  {1, . . . ,  IV}  such  that  ffT  ^  K . 

In  addition,  if  7  <  1,  and  if  there  exists  a  K -sparse  vector  in  F(y),  then  this  K -sparse  vector 
is  the  unique  minimizer  in  F(y)  of  ||  •  ||^_. 

At  first  sight,  these  results  suggest  there  is  nothing  to  be  gained  by  carrying  out  lT-  rather  than 
^-minimization;  in  addition  sparse  recovery  via  the  non-convex  problems  (7.1)  is  much  harder  than 
the  more  easily  solvable  convex  relaxation  problem  of  ^-minimization. 

Yet,  we  shall  show  in  this  section  that  ^-minimization  has  unexpected  benefits,  and  that  it 
may  be  both  useful  and  practically  feasible  via  an  IRLS  approach.  Before  we  start,  it  is  expedient 
to  introduce  the  following  definition:  we  shall  say  that  has  the  r-Null  Space  Property  (r-NSP) 
of  order  K  with  constant  7  >  0  if,  for  all  sets  T  of  cardinality  at  most  K  and  all  77  G  A7, 

WvtWJn  ^ -/\\t]t4T£n  ■  (7.3) 

In  what  follows  we  shall  construct  an  IRLS  algorithm  for  lT-  minimization.  We  shall  see  that 

(a)  In  practice,  ^-minimization  can  be  carried  out  by  an  IRLS  algorithm.  Hence,  the  non¬ 
convexity  does  not  necessarily  make  the  problem  intractable; 

(b)  In  particular,  if  satisfies  the  r-NSP  of  order  K,  and  if  there  exists  a  fc-sparse  vector  x*  in 

with  k  <  K  —  k  for  suitable  n  given  below,  then  the  IRLS  algorithm  converges  to  the 
£T-minimizer  xT,  which,  therefore,  will  coincide  with  x *; 

(c)  Surprisingly  the  rate  of  local  convergence  of  the  algorithm  is  superlinear;  the  rate  is  larger 
for  smaller  r,  increasing  to  approach  a  quadratic  regime  as  r  — >  0.  More  precisely,  we  will 
show  that  the  local  error  En  :=  \\xn  —  x*\\^N  satisfies 

En+1^p^,T)E2n~T,  (7.4) 
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where  ^(7,  r)  <  1  for  7  >  0  sufficiently  small.  The  validity  of  (7.4)  is  restricted  to  xn  in  a 
(small)  ball  centered  at  x* .  In  particular,  if  is  close  enough  to  x*  then  (7.4)  ensures  the 
convergence  of  the  algorithm  to  the  fc-sparse  solution  x*. 

Some  of  these  virtues  of  ^-minimization  were  recently  highlighted  by  Chartrand  and  his  col¬ 
laborators  [11,  12,  13].  Chartrand  and  Staneva  [13]  give  a  fine  analysis  of  the  RIP  from  which  they 
can  conclude  that  ^-minimization  not  only  recovers  ^-sparse  vectors,  but  that  the  range  of  k  for 
which  this  recovery  works  is  larger  for  smaller  r.  Namely,  for  random  Gaussian  matrices,  they  prove 
that  with  high  probability  on  the  draw  of  the  matrix  sparse  recovery  by  ^-minimization  works  for 
k  <  m[ci(r)  +  rc2(r)  log(IV/fc)]_1,  where  c\{t)  is  bounded  and  C2(r)  decreases  to  zero  as  r  ->  0. 
In  particular,  the  dependence  of  the  sparsity  k  on  the  number  N  of  columns  vanishes  for  r  — >  0. 
These  bounds  give  a  quantitative  estimate  of  the  improvement  provided  by  7r- minimization  vis  a 
vis  G -minimization  for  which  the  range  of  fc-sparsity  for  having  exact  recovery  is  clearly  smaller 
(see  Figure  8.4  for  a  numerical  illustration). 

7.1  Some  useful  properties  of  iT  spaces 

We  start  by  listing  in  one  proposition  some  fundamental  and  well-known  properties  of  iT  spaces  for 
0  <  r  ^  1.  For  further  details  we  refer  the  reader  to,  e.g.,  [19]. 

Proposition  7.4 

(i)  Assume  0  <  r  ^  1.  Then  the  map  z  1— ►  ||,z||^v  defines  a  quasi-norm  for  M.N ,  in  particular  the 
triangle  inequality  holds  up  to  a  constant,  i.e., 

|| u  +  v\\in  ^  C(t)  +  IMI^iy^  ,  for  all  u,  v  G  RN .  (7-5) 

If  one  considers  the  r-th  powers  of  the  “r-norm” ,  then  one  has  the  so-called  “r-triangle  inequality” : 

||u  +  v\\JN  ^  \\u\\}N  +  IMI^jv,  for  all  u,  v  €  RN .  (7-6) 

(ii)  We  have,  for  any  0  <  t\  ^  T2  ^  00 

Hk2<IHk,  forallueRN.  (7.7) 

We  will  refer  to  this  norm  estimate  by  writing  the  embedding  relation 

(in)  ( Generalized  Holder  inequality)  For  0  <  r  ^  1  and  0  <  p,  q  <  00  such  that  ^  |  and  for 

a  positive  weight  vector  w  =  {wf)f=1  we  have 

W(uWi)f=1\\eN(w)  <  IM|<*r(„,)|MI^(«)>  for  all  u,v  G  M.N ,  (7.8) 

where  IMI^at^)  :=  \vi\rwi^j  ;  as  usual  ,  for  0  <  r  <  00. 
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For  technical  reasons,  it  is  often  more  convenient  to  employ  the  r-triangle  inequality  (7.6)  than 
(7.5);  in  this  sense,  for  ^-minimization  |j  •  \\lN  turns  out  to  be  more  natural  as  a  measure  of  error 
than  the  quasi-nornr  ||  •  ||^jv. 


In  order  to  prove  the  three  claims  (a)-(c)  listed  before  the  start  of  this  subsection,  we  also 
need  to  generalize  to  tT  certain  results  previously  shown  only  for  i\.  In  the  following  we  assume 
0  <  r  ^  1.  We  denote  by 

£ Jk(z)£N  :=  J+(*)£, 

v>k 

the  error  of  the  best  /c-terrn  approximation  to  z  with  respect  to  ||  ■  ||Jjv-  As  a  straightforward  gen¬ 
eralization  of  analogous  results  valid  for  the  Id-norm,  we  have  the  following  two  technical  lemmas. 

Lemma  7.5  For  any  j  £  {1, . . .  ,  N},  we  have 

| &j(z)eN  -  crj(z')eN\  ^  || z  -  z'\\TtN, 

for  all  z,z'  £  WN .  Moreover,  for  any  J  >  j,  we  have 


( J  ~  j)r{z)j  ^  CTj(z)£N  ^  \\z  -  z'W^n  +  CFj(z')tN. 

Lemma  7.6  Assume  that  <!>  has  the  t-NSP  of  order  K  with  constant  0  <  7  <  1.  Then,  for  any 
z,z'  £  F{y),  we  have 

1  +  7 


z'  —  z H^jv  ^  - — 7  (llz'llj*  —  \\z\\TtN  +  2ax{z)^NSj  . 


The  proofs  of  these  lemmas  are  essentially  identical  to  the  ones  of  Lemma  4.1  and  Lemma  4.2, 
except  for  substituting  |j  •  \\fN  for  ||  •  |Ljv  and  0++ n  for  cr+Liv  respectively. 


7.2  An  IRLS  algorithm  for  ^-minimization 

To  define  an  IRLS  algorithm  promoting  £T-minimization  for  a  generic  0  <  r  ^  1,  we  first  define  a 
r-dependent  functional  JT,  generalizing  J\ 


JT(z,w,e )  :=  - 


"  N  N 

/  0  1  M 

+  E  1 

c2iv  1  2  T  1  1 

C  UJ  n  |  -j- 

r  „  / 

\_j=i  j  1 

V  wj  J  J 

z  £  RN,w  £  M+,e  £  R+. 


(7.9) 


The  desired  algorithm  is  then  defined  simply  by  substituting  JT  for  J  in  Algorithm  1,  keeping  the 
same  update  rule  (1.7)  for  e.  In  particular  we  have 


u++1  =  I  ( x 


„n+i\2 


+  c 


n+1 


2— T 
2 


,  j  =  l,...,N, 


and 


N 


Jr(xn+\wn+\en+1)  =  YJ 

3= 1 


n+l\2 


+  C 


71+1 
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Fundamental  properties  of  the  algorithm  are  derived  in  the  same  way  as  before.  In  particular, 
the  values  JT(xn,  wn ,  en)  decrease  monotonically, 

JT(xn+1,wn+1,en+ 1)  ^  JT(xn,wn,en),  n  ^  0, 

and  the  iterates  are  bounded, 

\\xn\\[N  ^  JT(x1,w°,e0)  :=  A0. 

As  in  Lemma  4.4,  the  weights  are  uniformly  bounded  from  below,  i.e., 

w]  >  A0,  j  =  1, . . . ,  N. 

Moreover,  using  JT  for  J  in  Lemma  5.1,  we  can  again  prove  the  asymptotic  regularity  of  the 
iterations,  i.e., 

lim  \\xn+1  -  xn\\£N  =  0. 

n — KX)  2 

The  first  significant  difference  with  the  £i-case  arises  when  e  =  lim^^oo  en  >  0.  In  this  latter 
situation,  we  need  to  consider  the  function 

N 

fl(z)  '■=  f2)  2  '  (7-10) 

3= 1 

We  denote  by  Z(  T(y)  its  set  of  minimizers  on  J-(y) (since  /£)T  is  no  longer  convex  it  may  have  more 
than  one  minimizer).  Even  though  every  minimizer  2  £  Z£)T(y)  still  satisfies 

(z,  if)w  =  0,  for  all  r)  £  N, 

where  w  =  wt,T,z  is  defined  by  wefT’z  =  (( Zj )2  +  e2)-^-,  j  =  1, . . . ,  N,  the  converse  need  no  longer 
be  true. 

The  following  theorem  summarizes  the  convergence  properties  on  the  algorithm  in  the  case 
r  <  1. 

Theorem  7.7  Fix  y  £  RN.  Let  K  (the  same  index  as  in  the  update  rule  (1.7)  )  be  chosen  so  that 
satisfies  the  t-NSP  of  order  K  with  a  constant  7  such  that  7  <  1  —  Let  Ze<T(y)  be  the  set  of 

accumulation  points  of  ( xn)n and  define  e  :=  limn^.0O  en.  Then,  the  algorithm  has  the  following 
properties: 

(?)  If  e  =  0,  then  Ze^T(y)  consists  of  a  single  point  x,  the  x ^  converge  to  x,  and  x  is  an  iT-minimizer 
in  F(y)  which  is  also  K -sparse. 

(??)  If  e  >  0,  then  for  each  x  £  Z£:T(y)  we  have  (x,rj)we,T,x  =  0,  for  all  rj  £  M . 

(in)  If  z  £  F(y)  and  x  £  Z€tT(y)  n  Ze%r(y),  we  have 

\\z  -  x\\[n  <  C2crk(z)eN  , 

for  all  k  <  K  —  jff-. 
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The  proof  of  this  theorem  uses  Lemmas  7. 1-7.6  and  follows  the  same  arguments  as  for  Theorem 
5.3. 

Remark  7.8  Unlike  Theorem  5.3,  Theorem  7.7  does  not  ensure  that  the  IRLS  algorithm  converges 
to  the  sparsest  or  to  the  minimal  ^-solution.  It  does  provide  conditions  that  are  verifiable  a 
posteriori  (e.g.,  e  =  lin^^oo  en  =  0)  for  such  convergence.  The  reason  for  this  weaker  result  is  the 
non-convexity  of  ff.  (In  particular,  it  might  happen  that  xe,T  is  a  local  minimizer  of  ff ,  but  not  a 
global  one,  and  the  estimate  in  (iii)  does  not  necessarily  hold.)  Nevertheless,  as  is  often  the  case 
for  non-convex  problems,  we  can  establish  a  local  convergence  result  that  also  highlights  the  rate 
we  can  expect  for  such  convergence.  This  is  the  content  of  the  following  section;  it  will  be  followed 
by  numerical  results  that  dovetail  nicely  with  the  theoretical  results. 


7.3  Local  superlinear  convergence 

Throughout  this  section,  we  assume  that  there  exists  a  k- sparse  vector  x*  in  E(y).  We  define  the 
error  vectors  r/n  =  xn  —  x*  E  A 7;  we  now  measure  the  error  by  ||  •  ||£  : 


En  :=  \\Vn\\Ti?. 

Theorem  7.9  Assume  that  has  the  t-NSP  of  order  K  with  constant  7  E  (0, 1)  and  that  E(y) 
contains  a  k  sparse  vector  x*  with  k  <  K.  (Here  K  is  the  same  as  in  the  definition  of  en  in  the 
update  rule  (1.7)  in  Algorithm  1.)  Suppose  that,  for  a  given  0  <  p  <  1,  we  have 


Eno  <  R*  :=  [pr(X*)kY 


(7.11) 


and  define 


h  ■=  p(p,  Ef  7,  t,  N )  =  21  r7(l  +  l)AT  I  1  + 


N 1~7 


2 -tn 


,  A  :=  (r(x‘)l~T(l  -  pf~T)~' 


,K  +  l-k 

If  p  and  7  are  sufficiently  small  so  that 

=ppT{1-T)r(x*)Tk{1~T)  ^  1, 

then  for  all  n  ^  no  we  have 


En+l  ^  pE,t 


2 -T 


(7-12) 

(7.13) 


Proof:  The  proof  is  by  induction  on  n.  We  assume  that  En  <  R*  and  derive  (7.13).  As  in 

the  proof  of  Theorem  6.1,  we  let  T  denote  the  support  of  x*  and  so  jf(T)  =  k  and  r(x*)k  is  the 
smallest  entry  in  x*.  Following  the  proof  of  Theorem  6.1,  the  first  few  lines  are  the  same.  The  first 
difference  is  in  the  following  estimate,  which  holds  for  i  6  7  and  replaces  (6.3), 


\x 


((xr)2  +  e2)1-r/2 


\x*  +  rjt 


n|2— r 

1 


I  1  —  T 


(1  ~P) 


2—r 


(|<|(1  -p))*~r 
<  A. 
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Starting  with  the  orthogonality  relation  (6.2)  and  using  the  above  inequality  and  the  embedding 
^  if,  we  obtain 

f>r+T<  «  a 

i= 1  \i£T 

We  now  apply  the  r-NSP  to  find 


TV 


\\v 


n+ 1 1 


2  T 

£2(wn 


Ew 

^2=1 


n+1'2w n 


^7  Ar 


7?t1| 


e- 


(7.14) 


At  the  same  time,  the  generalized  Holder  inequality  (see  Proposition  7.4  (iii))  for  p  =  2  and  g  = 
together  with  the  above  estimates,  yields 


r  c2t/(2-t)vu;  J 


In  other  words, 


(7.15) 


«)-  1/r)£lll5 


Let  us  now  estimate  the  weight  term.  By  the  ^-triangle  inequality  (7.6)  we  have 

f  TV  \  2_t 

£(l»7?|2  +  4)*  ) 

U=1  / 

/  TV  \  2_t  /  TV 

E«r+e»)j  =(Ewr+JVe: 

/TV  \  2_t 

( E  li.Tj  +  Jv2-v;<2-t) 


^  2 


2=1 
1  — T  | 


2— r 


Now,  an  application  of  Lemma  7.5  gives  the  following  estimates 

N2-TeTn(2-r)  =  JV-(l-r)(2-r)  (jyr £r  )2-r  ^  iy(l-r)(2-r)  (r(xn)^+1)2-r 


IV 


1— T 


AN  1  -  k' 
Using  these  estimates  in  (7.15)  gives 


iV1-7 


2— r 

'  a:  +  1  -  k 


2— r 


rUir 


2— r 


7^11^  <21-T72lT  1  + 


N1-7 


K+l-k 


2-rN 


2— r 

nnllT  1 


and  (7.13)  follows  by  a  further  application  of  the  r-NSP  (see  (6.5)). 

Because  of  the  assumption  (7.12),  we  also  have  En+\  ^  R*  and  so  the  induction  can  continue.  ■ 
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Remark  7.10  In  contrast  to  the  £\  case,  we  do  not  need  /r  <  1  to  ensure  that  En  decreases.  In  fact, 
all  that  is  needed  for  the  error  reduction  is  ^Ejl~T  <  1  for  some  sufficiently  large  n.  In  fact,  ^  could 
be  quite  large  in  cases  where  the  smallest  non-zero  component  of  the  sparse  vector  is  very  small. 
We  have  not  observed  this  effect  in  our  examples;  we  expect  that  our  analysis,  although  apparently 
accurate  in  describing  the  rate  of  convergence  (see  section  8),  is  too  pessimistic  in  estimating  the 
coefficient  /r. 

8  Numerical  results 

In  this  section  we  present  numerical  experiments  that  illustrate  that  the  bounds  derived  in  the 
theoretical  analysis  do  manifest  themselves  in  practice. 

8.1  Convergence  rates 

We  start  with  numerical  results  that  confirm  the  linear  rate  of  convergence  of  our  iteratively  re¬ 
weighted  least  square  algorithm  for  £ \  -minimization,  and  its  robust  recovery  of  sparse  vectors.  In 
the  experiments  we  used  a  matrix  $  of  dimensions  m  x  N  and  Gaussian  iV(0, 1/m)  i.i.d.  entries. 
Such  matrices  are  known  to  possess  (with  high  probability)  the  RIP  property  with  optimal  bounds 
[2,  4,  35].  In  Figure  8.1  we  depict  the  approximation  error  to  the  unique  sparsest  solution  shown 
in  Figure  8.2,  and  the  instantaneous  rate  of  convergence.  The  numerical  results  both  confirm  the 
expected  linear  rate  of  convergence  and  the  robust  reconstruction  of  the  sparse  vector. 

Next,  we  compare  the  linear  convergence  achieved  with  G -minimization  with  the  superlin- 
ear  convergence  obtained  by  the  iteratively  re-weighted  least  square  algorithm  promoting  £r- 
minimization. 

In  Figure  8.3  we  are  interested  in  the  comparison  of  the  rate  of  convergence  when  our  algorithm 
is  used  for  different  choices  of  0  <  r  ^  1.  For  r  =  1,  .8,  .6  and  .56,  the  figure  shows  the  error, 
as  a  function  of  the  iteration  step  n,  for  the  iterative  algorithm,  with  different  fixed  values  of 
r.  For  r  =  1,  the  rate  is  linear,  as  in  Figure  8.1.  For  the  smaller  values  r  =  .8,  .6  and  .56  the 
iterations  initially  follow  the  same  linear  rate;  once  they  are  sufficiently  close  to  the  sparse  solution, 
the  convergence  rate  speeds  up  dramatically,  suggesting  we  have  entered  the  region  of  validity  of 
(7.13).  For  smaller  values  of  r  numerical  experiments  do  not  always  lead  to  convergence:  in  some 
cases  the  algorithm  never  got  to  the  neighborhood  of  the  solution  where  convergence  is  ensured. 
However,  in  this  case  a  combination  of  initial  iterations  with  the  £ i  -inspired  IRLS  (for  which  we 
always  have  convergence)  and  later  iterations  with  £T-inspired  IRLS  for  smaller  r  allow  again  for  a 
very  fast  convergence  to  the  sparsest  solution;  this  is  illustrated  in  Figure  8.3  for  the  case  r  =  .5. 

8.2  Enhanced  recovery  in  compressed  sensing  and  relationship  with  other  work 

Candes,  Wakin,  and  Boyd  [8]  showed,  by  numerical  experimentation,  that  iteratively  re- weighted  Id- 
minimization,  with  weights  suggested  by  an  ^-minimization  goal,  can  enhance  the  range  of  sparsity 
for  which  perfect  reconstruction  of  a  sparse  vector  “works”  in  compressed  sensing.  In  experiments 
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Figure  8.1:  An  experiment,  with  a  matrix  of  size  250  x  1500  with  Gaussian  Ar(0,  ^g)  i.i.d.  entries, 
in  which  recovery  is  sought  of  the  45-sparse  vector  x*  represented  in  Figure  8.2  from  its  image 
y  =  <f>x.  Left:  plot  of  log10(||a:Tl  —  x*||.£1)  as  a  function  of  n,  where  the  xn  are  generated  by  Algorithm 
1,  with  en  defined  adaptively,  as  in  (1.7).  Note  that  the  scale  in  the  ordinate  axis  does  not  report 
the  logarithm  0,  —1,  —2, . . .,  but  the  corresponding  accuracies  10°,  10— 1 , 10— 2 , . . .  for  \\xn  — 

The  graph  also  plots  en  as  a  function  of  n.  Right:  plot  of  the  ratios  \\xn  —  xn+l\\i1/\\xn  —  xn~1||^1, 
and  ( en  —  en+i)/(eri_i  —  en )  for  the  same  examples. 
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Figure  8.2:  The  sparse  vector  used  in  the  example  illustrated  in  Figure  8.1.  This  vector  has 
dimension  1500,  but  only  45  non-zero  entries. 

with  iteratively  re-weighted  .^-minimization  algorithms,  Chartrand  and  several  collaborators  ob¬ 
served  a  similar  significant  improvement  [11,  12,  13,  14,  36];  see  in  particular  [13,  Section  4];  we 
also  illustrate  this  in  Figure  8.4.  It  is  to  be  noted  that  IRLS  algorithms  are  computationally  much 
less  demanding  than  weighted  i\ -minimization.  In  addition,  there  is,  as  far  as  we  know,  no  analysis 
(as  yet)  for  re-weighted  t\ -minimization  that  is  comparable  to  the  detailed  theoretical  analysis  of 
convergence  presented  here  of  our  IRLS  algorithm,  which  seems  to  give  a  realistic  picture  of  the 
numerical  computations. 

9  Acknowledgments 

We  would  like  to  thank  Yu  Chen,  Michael  Overton  for  various  conversations  on  the  topic  of  this 
paper,  and  Rachel  Ward  for  pointing  out  an  improvement  of  Theorem  5.3. 

Ingrid  Daubechies  gratefully  acknowledges  partial  support  by  NSF  grants  DMS-0504924  and 
DMS-0530865.  Ronald  DeVore  thanks  the  Courant  Institute  for  supporting  an  academic  year 
visit  when  part  of  this  work  was  done.  He  also  gratefully  acknowledges  partial  support  by  Of¬ 
fice  of  Naval  Research  Contracts  ONR-N00014-03- 1-0051,  ONR/DEPSCoR  N00014-03-1-0675  and 
ONR/DEPSCoR  N00014-00-1-0470;  the  Army  Research  Office  Contract  DAAD  19-02-1-0028;  and 
the  NSF  contracts  DMS-0221642  and  DMS-0200187.  Massimo  Fornasier  acknowledges  the  financial 
support  provided  by  the  European  Union  via  the  Individual  Marie  Curie  fellowship  MOIF-CT- 
2006-039438,  and  he  thanks  the  Program  in  Applied  and  Computational  Mathematics  at  Princeton 
University  for  its  hospitality  during  the  preparation  of  this  work.  Sinan  Giintiirk  has  been  sup¬ 
ported  in  part  by  the  National  Science  Foundation  Grant  CCF-0515187,  an  Alfred  P.  Sloan  Research 
Fellowship,  and  an  NYU  Goddard  Fellowship. 


31 


Figure  8.3:  We  show  the  decay  of  logarithmic  error,  as  a  function  of  the  number  of  iterations  of  the 
algorithm  for  different  values  of  t  (1,  0.8,  0.6,  0.56).  We  show  also  the  results  of  an  experiment  in 
which  the  initial  10  iterations  are  performed  with  r  =  1  and  the  remaining  iterations  with  r  =  0.5. 
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